{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Copyright 2016-present, Facebook, Inc.\n",
    "# All rights reserved.\n",
    "\n",
    "# This source code is licensed under the license found in the\n",
    "# LICENSE-examples file in the root directory of this source tree."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Study Sparse vs Dense Matrix Implementations\n",
    "Pysparnn defaults to sparse matricies but you may also use a dense matrix to improve performance\n",
    "\n",
    "This is typically when the number of dimensions is small"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# make sure you run 'python setup.py install' first!\n",
    "import pysparnn.cluster_index as ci\n",
    "import pysparnn.matrix_distance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Get data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# feature vectors are ~10% full and there are only 100 dimensions\n",
    "features = np.random.binomial(1, 0.1, size=(100000, 100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "test_features = features[:5000]\n",
    "train_features = features[5000:]\n",
    "\n",
    "data_to_return = range(train_features.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build models to compare"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/spencebeecher/anaconda2/lib/python2.7/site-packages/pysparnn/matrix_distance.py:189: RuntimeWarning: divide by zero encountered in true_divide\n",
      "  magnitude = 1.0 / (a_root_sum_square * self.matrix_root_sum_square)\n"
     ]
    }
   ],
   "source": [
    "cp = ci.MultiClusterIndex(train_features, data_to_return)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/spencebeecher/anaconda2/lib/python2.7/site-packages/pysparnn/matrix_distance.py:334: RuntimeWarning: divide by zero encountered in true_divide\n",
      "  magnitude = 1.0 / (a_root_sum_square * self.matrix_root_sum_square)\n",
      "/Users/spencebeecher/anaconda2/lib/python2.7/site-packages/pysparnn/matrix_distance.py:336: RuntimeWarning: invalid value encountered in multiply\n",
      "  return 1 - (dprod * magnitude)\n",
      "/Users/spencebeecher/anaconda2/lib/python2.7/site-packages/pysparnn/matrix_distance.py:108: RuntimeWarning: invalid value encountered in less_equal\n",
      "  dist_filter = (dist_matrix <= max_distance)\n"
     ]
    }
   ],
   "source": [
    "dense_cp = ci.MultiClusterIndex(train_features, data_to_return, \n",
    "                                distance_type=pysparnn.matrix_distance.DenseCosineDistance)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Answer Key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pysparnn_utils"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import NearestNeighbors \n",
    "knn = NearestNeighbors()\n",
    "        \n",
    "knn.fit(train_features)\n",
    "\n",
    "# get top 3 nearest neighbors for each document\n",
    "answers = knn.kneighbors(test_features, 3, return_distance=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare Performance\n",
    "Don't worry so much about the recall performance. There are many items in this space (congested). These methods should return close matches even if they arent the closest absolute matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Percent of time sparse returns a top 3 result: 0.2498\n"
     ]
    }
   ],
   "source": [
    "t0 = time.time()\n",
    "\n",
    "results = cp.search(test_features, return_distance=False)\n",
    "\n",
    "print('Percent of time sparse returns a top 3 result: {}'.format(pysparnn_utils.recall(answers, results).mean()))\n",
    "\n",
    "cp_time = time.time() - t0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Percent of time dense returns a top 3 result: 0.2458\n"
     ]
    }
   ],
   "source": [
    "t0 = time.time()\n",
    "\n",
    "results = dense_cp.search(test_features, return_distance=False)\n",
    "\n",
    "print('Percent of time dense returns a top 3 result: {}'.format(pysparnn_utils.recall(answers, results).mean()))\n",
    "\n",
    "dense_cp_time = time.time() - t0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4.979948311566905"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sparse is x times slower than dense\n",
    "cp_time / dense_cp_time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Analysis:** Equivalent performance (the indexes use random seeds for construction) and the dense version is ~4x faster in this case."
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
